This homework is longer than the other ones - take time to get through it all!
Work through each section and provide a well-commented response in one or more Python cells. I should be able to execute the notebook and obtain the same results that you show in the cell (do not clear the cell outputs before uploading your final notebook to Git!)
Part of your grade (see rubric) is based on preparing a logical notebook that is easy to follow according to programming guidelines from Dave's Monday (2/13) lecture. I won't be a stickler, but don't give me messy code please.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
import math
from scipy import stats
%matplotlib inline
Explain in your own words the following concepts:
note: I expect you to not simply read these definitions in ISL and modestly change the words. Please take a few minutes to read multiple definitions and come up with your own definition
Look up the 'Default' dataset features in the ISL textbook. See page 130 of the book and online resources for more information. This dataset should be easy to find
Implement both the linear and logistic regression models using sklearn, calculate the coefficients $\beta_0$ and $\beta_1$ and exactly (to the best of your ability) reproduce Figure 4.2 in the book.
In [2]:
data = pd.read_csv('Default.csv')
data = data.drop('Unnamed: 0',axis = 1)
#change Yes, No to 1, 0.
data['def_chg'] = data.default.factorize()[0]
data.head()
Out[2]:
In [3]:
#linear regression
regr = linear_model.LinearRegression()
regr.fit(data.balance.reshape(-1,1),data.def_chg)
#logistic regression
log_org = linear_model.LogisticRegression(solver='newton-cg')
log_org.fit(data.balance.reshape(-1,1),data.def_chg)
x_array = np.arange(data.balance.min(),data.balance.max())
plt.figure(figsize=(10,5))
#left
plt.subplot(121)
plt.scatter(data.balance,data.def_chg)
plt.plot(data.balance,regr.predict(data.balance.reshape(-1,1)),color = 'g')
plt.plot([-100,3000],[0,0],linestyle = 'dashed', color = 'k')
plt.plot([-100,3000],[1,1],linestyle = 'dashed', color = 'k')
plt.xlim([-140,3000])
plt.xlabel('Balance')
plt.ylabel('Probability of default')
#right
plt.subplot(122)
plt.scatter(data.balance,data.def_chg)
plt.plot(x_array, log_org.predict_proba(x_array.reshape(-1,1))[:,1], color = 'g')
plt.plot([-100,3000],[0,0],linestyle = 'dashed', color = 'k')
plt.plot([-100,3000],[1,1],linestyle = 'dashed', color = 'k')
plt.xlim([-140,3000])
plt.xlabel('Balance')
plt.ylabel('Probability of default')
Out[3]:
In [4]:
print('B0, B1 for linear: ', regr.intercept_, regr.coef_[0])
print('B0, B1 for logistic: ', log_org.intercept_[0], log_org.coef_[0][0])
Make sure you understand all the components of the SLR model Jim showed in class and code the same thing yourself
Generate a separate set of training data (25 points) and validation data (15 poins). Each data set should have the same amount of irreducible error (random noise term) applied to it
Make a plot of the error in the testing data as a function of the random strength. There is no "one answer" for this part, open ended graded based on creativity and quality of results
For the exact example I showed in class, calculate the RSE, SSR, $R^2$, 95% confidence intervals for both $\beta_i$, and evaluate the P-value at 0.05 significance level for $\beta_1$
Explain in plain language the meaning of the P-value test
In [5]:
#model: Y = 3X + 4
#Generate a separate set of training data (25 points) and validation data (15 poins).
pts=25
x_tra = np.linspace(-50,50,num = pts)
x_val = np.linspace(-50,50,num = 15)
B0=4
B1=3
y_act = B0 + B1*x_tra
y_val = B0 + B1*x_val
np.random.seed(123)
#add noise scaled to 25% of range to data
yrand = y_act + .25*(y_act.max()-y_act.min())*np.random.normal(size = pts)
yrand_val = y_val + .25*(y_val.max()-y_val.min())*np.random.normal(size = 15)
plt.figure(figsize=(5,15))
plt.subplot(311)
plt.scatter(x_tra,yrand)
plt.title('Training data (25 points)')
plt.ylabel('y')
plt.xlabel('x')
plt.grid(linestyle = 'dashed')
plt.subplot(312)
plt.scatter(x_val,yrand_val)
plt.title('Validation data (15 points)')
plt.ylabel('y')
plt.xlabel('x')
plt.grid(linestyle = 'dashed')
plt.subplot(313)
#Use a for loop to test adding different percentage strength of error
for i in range(100):
yrand_val = y_val + (i/100)*(y_val.max()-y_val.min())*np.random.normal(size = 15)
#calculate RSE
plt.plot(i/100,math.sqrt((((y_val-yrand_val)**2).sum())/(15-2)),marker='o',color = 'b')
plt.title('Error in testing data as a function of random strength')
plt.ylabel('RSE')
plt.xlabel('Percent Error in Testing Data')
plt.grid(linestyle = 'dashed')
In [6]:
#exact example showed in class
#model: Y = 3X + 4
#size of training data and scale of random noise
pts=25
noisescale=.25
x=np.linspace(-50,50,num=pts)
B0=4
B1=3
yactual=B0+B1*x
np.random.seed(123)
#add noise scaled to 25% of range to data
yrand=yactual+noisescale*(yactual.max()-yactual.min())*np.random.normal(size=pts)
#SLR
regr=linear_model.LinearRegression()
regr.fit(x.reshape(-1,1),yrand)
print('B0, B1: ',regr.intercept_, regr.coef_[0])
ypred = regr.predict(x.reshape(-1,1))
RSS = ((yrand - ypred)**2).sum()
print('RSS =', RSS)
RSE = math.sqrt(RSS/(pts - 2))
print ('RSE =', RSE)
TSS = ((yrand - yrand.mean())**2).sum()
Rsqr = (TSS-RSS)/TSS
print ('R^2 =', Rsqr)
SE_beta0 = math.sqrt((RSE**2)*(1/pts + x.mean()/((x - x.mean())**2).sum()))
SE_beta1 = math.sqrt(RSE**2 / ((x - x.mean())**2).sum())
print('95% confidence interval for B0:',[regr.intercept_ - 2*SE_beta0, regr.intercept_ + 2*SE_beta0])
print('95% confidence interval for B1:',[regr.coef_[0] - 2*SE_beta1,regr.coef_[0] + 2*SE_beta1])
t = (regr.coef_[0] - 0)/SE_beta1
p = stats.t.sf(np.abs(t), pts-2)*2
print('p value of B1:', p)
In [7]:
harvard = pd.read_csv('HCEPD_100K.csv')
regr2 = linear_model.LinearRegression()
regr2.fit(harvard[['mass','voc','e_lumo_alpha']],harvard.pce)
print(regr2.coef_)
print(regr2.intercept_)
# generate matrix X to make predictions of PCE over the X parameter space
pts=100
X=np.zeros((pts,3))
X[:,0]=np.linspace(harvard.mass.min(),harvard.mass.max(),pts)
X[:,1]=np.linspace(harvard.voc.min(),harvard.voc.max(),pts)
X[:,2]=np.linspace(harvard.e_lumo_alpha.min(),harvard.e_lumo_alpha.max(),pts)
# plot the predicted data
plt.figure(figsize=(5,10))
plt.subplot(311)
plt.scatter(harvard.mass,harvard.pce)
plt.plot(X[:,0],regr2.predict(X),color='red',lw=3)
plt.ylabel('PCE')
plt.xlabel('$mass$')
plt.subplot(312)
plt.scatter(harvard.voc,harvard.pce)
plt.plot(X[:,1],regr2.predict(X),color='red',lw=3)
plt.ylabel('PCE')
plt.xlabel('$VOC$')
plt.subplot(313)
plt.scatter(harvard.e_lumo_alpha,harvard.pce)
plt.plot(X[:,2],regr2.predict(X),color='red',lw=3)
plt.ylabel('PCE')
plt.xlabel('$E_{LUMO}$')
Out[7]:
In [8]:
#TSS
TSS = 0
pce_mean = harvard.pce.mean()
for i in harvard.pce:
TSS += (i - pce_mean)**2
print('TSS = ', TSS)
#RSS
RSS = 0
for i in range(harvard.shape[0]):
RSS += (harvard.pce[i] - regr2.intercept_ - regr2.coef_[0]*harvard.mass[i]
- regr2.coef_[1]*harvard.voc[i] - regr2.coef_[2]*harvard.e_lumo_alpha[i])**2
print('RSS = ', RSS)
p = 3
n = harvard.shape[0]
F = ((TSS - RSS)/p)/(RSS/(n-p-1))
print('F-statistic = ', F)
In [9]:
#beta 1
RSS0 = 0
for i in range(n):
RSS0 += (harvard.pce[i] - regr2.intercept_ - regr2.coef_[1]*harvard.voc[i]
- regr2.coef_[2]*harvard.e_lumo_alpha[i])**2
F = (RSS0-RSS)/(RSS/(n-p-1))
p = stats.t.sf(np.abs(F),n-2)*2
print("p value of beta1:",p,"(note: 0.0 means < 1e-6)")
#beta 2
RSS0 = 0
for i in range(n):
RSS += (harvard.pce[i] - regr2.intercept_ - regr2.coef_[0]*harvard.mass[i]
- regr2.coef_[2]*harvard.e_lumo_alpha[i])**2
F = (RSS0-RSS)/(RSS/(n-p-1))
p = stats.t.sf(np.abs(F),n-2)*2
print("p value of beta2:",p,"(note: 0.0 means < 1e-6)")
#beta 3
RSS0 = 0
for i in range(n):
RSS += (harvard.pce[i] - regr2.intercept_ - regr2.coef_[0]*harvard.mass[i]
- regr2.coef_[1]*harvard.voc[i])**2
F = (RSS0-RSS)/(RSS/(n-p-1))
p = stats.t.sf(np.abs(F),n-2)*2
print("p value of beta3:",p,"(note: 0.0 means < 1e-6)")
I would like you to write a clearly labeled function that performs bootstrapping using a subset of the HCEPD_100K.csv data. Your function should be versatile in terms of how much of the data is selected for the bootstrap and and how many iterations are run.
Prepare a plot of MSE (using the same 3-parameter PCE fit (using mass/VOC/lumo features). The x-axis should be the number of bootstrap samples and the y-axis should be the MSE. You should show a line for boostrap samples sizes of 100, 1000, and 5000.
For the same sampling you should show a plot the estimate of the three relevant beta coefficients
In [11]:
#Create a bootstrap sampling fuction and return average MSE and beta
def bootstrap(data, num, iter_):
MSE_avg = 0
B1_avg = 0
B2_avg = 0
B3_avg = 0
for i in range(iter_):
#choose randomly with replacement
df = data.sample(num, replace = True)
#MLR
regr = linear_model.LinearRegression()
regr.fit(df[['mass','voc','e_lumo_alpha']],df.pce)
pred = regr.predict(df[['mass','voc','e_lumo_alpha']])
#MSE and beta
MSE = ((df.pce - pred)**2).mean()
MSE_avg += MSE
B1_avg += regr.coef_[0]
B2_avg += regr.coef_[1]
B3_avg += regr.coef_[2]
return [MSE_avg/iter_, B1_avg/iter_, B2_avg/iter_, B3_avg/iter_]
#collect these values to plot
MSE_list = []
B1_list = []
B2_list = []
B3_list = []
for i in range(1,51):
MSE_sublist = []
B1_sublist = []
B2_sublist = []
B3_sublist = []
for j in [100,1000,5000]:
bst = bootstrap(harvard, j, i)
MSE_sublist.append(bst[0])
B1_sublist.append(bst[1])
B2_sublist.append(bst[2])
B3_sublist.append(bst[3])
MSE_list.append(MSE_sublist)
B1_list.append(B1_sublist)
B2_list.append(B2_sublist)
B3_list.append(B3_sublist)
In [12]:
#MLR on original data set
x = range(1,51)
regr = linear_model.LinearRegression()
regr.fit(harvard[['mass','voc','e_lumo_alpha']],harvard.pce)
pred = regr.predict(harvard[['mass','voc','e_lumo_alpha']])
MSE = ((harvard.pce - pred)**2).mean()
#MSE
plt.figure(figsize=(5,4))
plt.subplot(111)
plt.plot(x, np.asarray(MSE_list)[:,0], label='size = 100')
plt.plot(x, np.asarray(MSE_list)[:,1], label='size = 1000')
plt.plot(x, np.asarray(MSE_list)[:,2], label='size = 5000')
plt.plot([-2,52],[MSE, MSE],color = 'k', linestyle = 'dashed', label='MLR on original data')
plt.xlim([-2,52])
plt.xlabel('Number of bootstrap samples ')
plt.ylabel('MSE')
plt.legend()
plt.grid()
#beta1
plt.figure(figsize=(16,4))
plt.subplot(131)
plt.plot(x, np.asarray(B1_list)[:,0], label='size = 100')
plt.plot(x, np.asarray(B1_list)[:,1], label='size = 1000')
plt.plot(x, np.asarray(B1_list)[:,2], label='size = 5000')
plt.plot([-2,52],[regr.coef_[0],regr.coef_[0]],color = 'k', linestyle = 'dashed', label='MLR on original data')
plt.xlim([-2,52])
plt.ylabel(chr(946)+'1')
plt.xlabel('Number of bootstrap samples ')
plt.legend()
plt.grid()
#beta2
plt.subplot(132)
plt.plot(x, np.asarray(B2_list)[:,0], label='size = 100')
plt.plot(x, np.asarray(B2_list)[:,1], label='size = 1000')
plt.plot(x, np.asarray(B2_list)[:,2], label='size = 5000')
plt.plot([-2,52],[regr.coef_[1],regr.coef_[1]],color = 'k', linestyle = 'dashed', label='MLR on original data')
plt.xlim([-2,52])
plt.ylabel(chr(946)+'2')
plt.xlabel('Number of bootstrap samples ')
plt.legend()
plt.grid()
#beta3
plt.subplot(133)
plt.plot(x, np.asarray(B3_list)[:,0], label='size = 100')
plt.plot(x, np.asarray(B3_list)[:,1], label='size = 1000')
plt.plot(x, np.asarray(B3_list)[:,2], label='size = 5000')
plt.plot([-2,52],[regr.coef_[2],regr.coef_[2]],color = 'k', linestyle = 'dashed', label='MLR on original data')
plt.xlim([-2,52])
plt.ylabel(chr(946)+'3')
plt.xlabel('Number of bootstrap samples ')
plt.legend()
plt.grid()